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Abstract — The theoretical basis for a candidate variational 
principle for the information bottleneck (IB) method is formu- 
lated within the ambit of the generalized nonadditive statistics 
of Tsallis. Given a nonadditivity parameter q, the role of the 
additive duality of nonadditive statistics (g* = 2 — q ) in relating 
Tsallis entropies for ranges of the nonadditivity parameter q < 1 
and q > 1 is described. Defining X, X, and Y to be the source 
alphabet, the compressed reproduction alphabet, and, the rele- 
vance variable respectively, it is demonstrated that minimization 
of a generalized IB (gIB) Lagrangian defined in terms of the 
nonadditivity parameter q* self -consistently yields the nonadditive 
effective distortion measure to be the q-deformed generalized 
Kullback-Leibler divergence: D q K _ L \p(Y\X)\\p(Y\X)]. This re- 
sult is achieved without enforcing any a-priori assumptions. Next, 
it is proven that the q* —deformed nonadditive free energy of the 
system is non-negative and convex. Finally, the update equations 
for the gIB method are derived. These results generalize critical 
features of the IB method to the case of Tsallis statistics. 

I. Introduction 

Rate distortion (RD) theory [1,2] is a major branch of 
information theory which provides the theoretical foundations 
for lossy data compression. RD theory addresses the problem 
of determining the minimal amount of entropy (or information) 
R that should be communicated over a channel, so that the 
source (input signal/source alphabet/codebook) X £ X can 
be approximately reconstructed at the receiver (output sig- 
nal/reproduction alphabet/quantized codebook) X G X with- 
out exceeding a given expected distortion D. Note that calli- 
graphic fonts are used to denote sets. In turn, the information 
bottleneck (IB) method is a technique introduced by Tishby, 
Pereira, and Bialek [3, 4] for finding the best tradeoff between 
accuracy and complexity (compression) when summarizing 
(e.g. clustering) a discrete random variable X, given a joint 
probability distribution between X and a relevance variable 
Y G y, i.e. p(x,y). In this regard, the IB method represents 
a significant qualitative improvement over RD theory. The 
IB method has acquired immense utility in machine learning 
theory. For example, the IB method and its modifications 
have successfully been employed in applications in diverse 
areas such as genome sequence analysis, astrophysics, and, 
text mining [4]. q— Deformed (or Tsallis) statistics [5,6] has 
recently been shown to yield interesting improvements con- 
cerning RD theory [7]. The present paper extends analogous 
"q-"considerations to the IB method. 
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The generalized (nonadditive) statistics of Tsallis' has re- 
cently been the focus of much attention in statistical physics, 
and allied disciplines. Note that the terms generalized statis- 
tics, q — deformed statistics, nonadditive statistics, and nonex- 
tensive statistics are used interchangeably. Nonadditive statis- 
tics, which generalizes the Boltzmann-Gibbs-Shannon (B-G-S) 
statistics, has recently found much utility in a wide spectrum 
of disciplines ranging from complex systems and condensed 
matter physics to financial mathematics. A continually updated 
bibliography of works in nonadditive statistics may be found 
at http://tsallis.cat.cbpf.br/biblio.htm 

Since the work on nonextensive source coding by Landsberg 
and Vedral [8], a number of studies on the information 
theoretic aspects of generalized statistics pertinent to coding 
related problems have been performed [9-12]. Most recently, 
the nonadditive statistics of Tsallis [5,6] has been utilized to 
develop a generalized statistics RD theory [7]. This paper [7] 
investigates nonadditive statistics within the context of RD 
theory in lossy data compression. The generalized statistics 
RD model performs variational minimization of the nonaddi- 
tive RD Lagrangian employing a method developed [13] to 
"rescue" the linear constraints originally employed by Tsallis 
[5]. Nonadditive statistics possesses a number of constraints 
having different forms [14-16]. 

RD theory is now briefly described as a precursor to intro- 
ducing the leitmotif for the IB method. For a source alphabet 
X G X and a reproduction alphabet X G X, the mapping of 
x G X to x G X is characterized by the quantizer p(x\x). The 
RD function is obtained by minimizing the generalized mutual 
information (GMI) I q (X;X) (defined in Section 2, [7]) over 
all normalized p(x\x In RD theory I q (X;X) is known 
as the compression information (see Section 3, [7]). Here, q 
is the nonadditivity parameter [5, 6]. A significant feature of 
the nonadditive RD model [7] is that the threshold for the 
compression information is lower than that encountered in RD 
models derived from B-G-S statistics. This feature augurs well 
for utilizing Tsallis statistics in data compression applications. 



1 The absence of a definitive nonadditive channel coding theorem sometimes 
prompts the use of the term generalized mutual entropy instead of GMI [10]. 



By definition, the nonadditive RD function is [1,2,7] 

R„{D)= min I q (X;X);0<q<l, 

p(x\xy.(d(x,x)) p(x ~ } <D U / 

(1) 

where, R q (D) is the minimum of the compression information. 
The distortion measure is denoted by d(x, x) and is taken to be 
the Euclidean square distance for most problems in science and 
engineering [1,2]. Given d(x, x), the partitioning of X induced 
by p(x\x) has an expected distortion D =< d(x,x) > p { x , x )- 
Note that in this paper, denotes the expectation with 

respect to the probability RD theory a-priori specifies 
the nature of the distortion measure, which is tantamount to an 
a-priori specification of the features of interest in the source 
alphabet X to be contained in compressed representation X. 
RD theory lacks the framework to specify the features of 
interest in X to be contained in X, that are relevant to a given 
study. To ameliorate this drawback, the IB method introduces 
another variable Y e y, the relevance variable. Note that Y 
need not inhabit the same space as X. 

Thus, the crux of the IB method is to simultaneously 
minimize the compression information I q (X; X) and maximize 
the relevant information I q (X;Y). More specifically, the IB 
method extracts structure from the source alphabet via data 
compression, followed by a quantification of the information 
contained in the extracted structure with respect to a relevance 
variable. Consequently, the IB method "squeezes" the infor- 
mation between X and Y through a bottleneck X. The IB 
method is compactly described by the Markov condition [3,4] 



X <-» X 



Y. 



(2) 



As discussed in [7], the un-normalized GMI in Tsallis statistics 
acquires different forms in the regimes < q < 1 and q > 1, 
respectively. For example, for < q < 1, the GMI is of the 
form / 0<9<1 (X;X) = - £p (x,x) In, (sggf ). 

For q > 1, the GMI is defined by I q>1 (X;X) = S q (X) + 
S q {X)-S q (X,X), where S q (X) and S q {X) are the marginal 
Tsallis entropies for the random variables X and X, and, 
S q (X,X) is the joint Tsallis entropy [7]. Unlike the B-G-S 
case, 7o<9<i(^; X) can never acquire the form of I q> \, and 
vice versa [7, 9, 10], the reason being that the sub-additivities 
Sq(X\X) < S q (X) and S q (X\X) < S q (X) are not generally 
valid when < q < 1. While the form of I < q <i(X; X) is 
important in a number of applications of practical interest in 
coding theory and learning theory where it is desirable that 
the GMI be expressed as the generalized Kullback-Leibler 
divergence (K-Ld) between the joint probability p(x, x) and 
the marginal probabilities p(x) and p(x) [1], un-normalized 
Tsallis entropies for q > 1 possess a number of important 
properties such as the generalized data processing inequality 
and the generalized Fano inequality [10]. The different forms 
of the GMI for < q < 1 and q > 1 are reconciled by in- 
voking the additive duality of nonadditive statistics [17]. This 
entails a re-parameterization of the nonadditivity parameter 
q* = 2 — q, resulting in dual Tsallis entropies. 



This paper derives a theoretical basis for the generalized IB 
(gIB) method, which is a fundamental qualitative extension 
of the seminal work of Tishby, Pereira, and Bialek [3]. 
This analysis commences with the minimization of the gIB 
Lagrangian 

L q gIB W\x)] =I q (x-x)-p gIB I q (x-Y)-q>l, (3) 

subject to the normalization of p(x\x). Here, [3 g iB is the 
gIB tradeoff parameter for the simultaneous minimization and 
maximization described by (3). From (3), it is easily shown 

that: 6Ll IB Wz\z)] = =► = Thus, by 

increasing $ 9 ib, convex curves akin to the RD curves [1,2,7], 
may be constructed in the "information plane" (I q (X;X), 
I q (X: Y)). These are called relevance-compression curves [4]. 

Apart from its ability to model long-range interactions when 
performing clustering of complex data sets, the gIB method 
also facilitates the analysis of the IB method within the context 
of predictability [18]. Predictability may be viewed as an 
excursion from the extensive B-G-S statistics, and is inherently 
nonextensive (nonadditive). 



II. Tsallis Entropies and Dual Tsallis Entropies 

The un-normalized Tsallis entropy, conditional Tsallis en- 
tropy, joint Tsallis entropy, the jointly convex generalized K- 
Ld, and, the GMI may thus be written as [19,20] 

s q (x) = - J2p(x) 9 Kp(*) > S i (*| x ) 
= -J2J2p( x ^) 9ln qp( x \ x )^ 

x 5 

S q (x,X) = - E E P (z, x ) q ln g p(x,x) 

= S q (X) + S q (x]x] - S q (X) + S q (X\X), (4) 
D] ( _ L {p{X)\\r{X)) = -^p{x)\n q ^, 

x ' x,x 

= D q K _ L (p(x,x) \\p(X)p(x) 

respectively. The q-deformed logarithm and the q-deformed 
exponential are defined for < q < 1 as [21] 

and, 

[1 + (1 - q) x}^ ; 1 + (1 - q) x > 0, 
0; otherwise 



exp (x) 



(5) 

The operations of q — deformed relations are governed by 
q — algebra and q — calculus [21]. Apart from providing 
an analogy to equivalent expressions derived from B-G-S 
statistics, q — algebra and q — calculus endow general- 
ized statistics with a unique information geometric structure. 
Salient results of q-algebra employed in this paper involving 



the q — deformed addition (® q ) and subtraction (0 9 ), are [21] 



x (B q y = x + y - 



(1 - q)xy, 
-, where, Q q y 



i+(i-q)v) ] 



1 y l+(l-g)y ' 
In, (xy) = In, (a;) ® 9 ln 9 (x) 
= \n q (x) + In, (y) + (1 - q) \n q (x) \n q (y) , 

ln 9 (%) = In, (x) Q q ln q (x) = ^(ln^z) - \n q (yj). 

(6) 

Given two independent variables X and Y, one of the funda- 
mental consequences of nonadditivity of the Tsallis entropy is 
the pseudo-additivity relation 

S q (XY) = S q (X) + S q (Y) + (l-q) S q (X) S q (Y) . (7) 

Re-parameterizing (5) via the additive duality q* = 2 — q, 
yields the dual deformed logarithm and exponential 

liV (x) = - In, (I) , and, exp^, (a;) = c ^ Pq \_ x) ■ (8) 
A dual Tsallis entropy defined by 

s,.(i) = -X?WVW- (9) 

X 

The dual Tsallis joint entropy obeys the relation 

s q . (x,x) = S q . (X) + s q . (x| x) , 

where, (1Q) 
S q . ( X X)=- Yl_p(x,x) ln q . p(x\ x). 

' x.x 

Here, ln 9 * (x) = - ■ The dual Tsallis entropies acquire 

a form identical to the B-G-S entropies, with ln ? . (•) replacing 
log(«). The GMFs I q>1 (X;X) and J <g.<i(X; -X") defined 
by the nonadditivity parameters q > 1 and < q* < 1 
respectively, relate to each other as (Theorem 3, [7]) 

I q .(X;X) = -EE P (x,x)ln q . (sggf) 

i9 '= q) S q (X) + S q (x) -S q (x-,X) =I q (X;X) ( n ) 

Here, "g* — > g" is a re-parameterization from g* to g, 
and,"g — > <j*" is a re-parameterization from g to q*. 

III. Generalized Information Bottleneck 
Variational Principle 

A. Self-consistent equations 

Depending upon the "upstream" and "downstream" vari- 
ables in the Markov condition (2), the total probability may 
be expressed as 

X <- X <- Y =>p{x,x,y) = p(x,y)p(x\x) , 
X — > X — > Y => p (x, x, y) = p (x, x)p(y\ x) . 



(12) 



The Markov condition X < — X < — Y yields [3] 

P(y\ £ ) = -^rr^p(y\x)p(x\x)p(x). 
P \ x ) „ 



Since X^X^Y = Y^X^X, the Markov condition 
yields through application of Bayes rule and consistency [3] 

P{y\x) = ^2,p{y\x)p(x\x). (14) 

x£X 

Thus 

P(x) = T,P(x,x,y) = J2p( x )p(x\x), 

x,y x 

and, (15) 

p(x,y) = J2p( x ,x,y) = T,p(x,y)p(x\ x). 



From (15), the following relations are obtained 

= P (x),and,PM=P(*\y)- d6) 



Sp(x\x) r v ' ' ~' 5p(x\x) 

B. The variational principle 

The gIB Lagrangian (3) cannot be expressed in terms of the 
generalized K-Ld. As discussed in [7], the additive duality 
is required to express the GMI for q > 1 in terms of the 
generalized K-Ld. This is required to formulate nonadditive 
numerical schemes akin to the EM algorithm [22], using the 
alternating minimization method based on the Csiszar-Tusnady 
theory [23]. The gIB Lagrangian in q* — space is 

L q giB \P(X\ x)] = I q . U; X\ -f3 gIB I q . (X; y) ; < q* < 1, 

f ( ~? 

contingent to the normalization of p(x\x). Here, I q . I X;X\ 

and (x;Y^j are obtained from I q (X;X) and I q (X;Y) 
employing (11). Variational minimization of (17) [7, 13] yields 



s j-q 

&p(x\x) gIB 



^Ep(y\x)(£^y~ 9 * 



1-9* 



^ =0. 

p(x) 



(18) 



Eiflfi and 

p(x) 



Here, (a) is from Bayes' theorem p ^~^ 
E<JML = an d, (16). The term p(x) is canceled out. A 

\n q » (•) term is introduced in (18), by adding and subtracting 

4/sE# . t0 y ield 



^ (Md ^ + ^ p ( y GfSfe) 

V 

-A(D =0. 



(19) 



(13) 



In (19), -\V-\x) = -m + f3 gIB E ^ , which is only 

v 

dependent on x. The second term in (19) is expressed as: 
PgiBT,P(y\ x ) ln q* (ffflfy) employing q* - deformed 



subtraction and addition (6), yielding 



!>{'''■) 



y- q 

q*-\ \p(x\x) ) 

-0 g iBj2p{y\ x ) 



(V) 



In f p&L) e l n f Ph 

-A« (x) + (3 gIB Zp{y\x) In,, (^gy) = 

v 

=> (jfiSj) 1 "'* +^/bEp(»i«) v (Si 

\ / y \ 

-A< 2 ) (a) = 0. 



Here, [•] denotes the arguments [p (x| x) ;p (x) ;p ( y\ x)] of 
the free energy. Note F q j B [»] = j3 gIB F^ elmholtz , where 
pHeimhoitz j s t j ie ^* _ deformed gIB Helmholtz free energy. 
Invoking (4) and (21), (24) yields 



^bM 



p(ic,£) 



Here, -A< 2 > (x) 



(20) 



/Vb V (z; y) 8,. A« (x)\ , I,, (a:; Y) 

-Y,P(y\ x ) ln g* \ -pT$x) \ Multiplying (20) by p(x|x) and 

y v ' 

summing over x, yields 

+)§./b/ep(»i*)v(55B)\ • 

\ y / p(x|x) 

Defining 9f flJ s(a;) = (g* - 1) A (2) (x), (20) yields 
p (x x) = 



(a) 



x,x,y 

= d£_ l [ p (x,x) \\p(x) P (xy 
+P g i B D q K _ L [ P (x, x, y) ||p (x, 1) P (y|x) 



9 Vp(j , I 2: ) 



iln n 



(21) 



p(x) 



l-(9*-l) 



^p(y|x)ln, 



(22) 

Setting g = (3 g i B (x), and, invoking the additive duality 
in the numerator (g* = (2 — g)), (22) yields the canonical 
transition probability 

,[p(y|a:)l|p(2/|£)]] 

(23) 



, _, . ,_, exp [-/3 g i B (a;)D 
p (x| X) = p (XJ 



Z{x,p gIB {x)) 
Z ( X, /3g/B (x) ) = ^SglB (x) !-«* 



In (23), P 9 ib{x) is the gIB tradeoff parameter evaluated 
for each source alphabet x G X, and, Z (x, P g i B (xfj 
is the partition function. The effective distortion measure 
has been self consistently obtained via the variational prin- 
ciple to be D q K _ L [p{y\x)\\p{y\x)], without any a-priori 
assumptions. In the limit q — > 1 the B-G-S statistics re- 
sult [3,4] is recovered. Solutions of (23) are valid only for 
[l-{l-q)[3 gIB {x)D q K _ L [p{y\x)\\p{y\x)]} > 0. The 

condition jl - (1 - q) j3 gIB (x)D q K _ L [p ( y\ x)\\ p (y\ x)]| < 
is called the Tsallis cut-off condition [5], and requires setting 
p(x|x) =0 and stopping the iteration at the given [3 gIB . 

C. Free energy of the system 

The q* — deformed nonadditive free energy of the system 
is [3] 



Ff IB \p(x\x);p(xy,p(y\x)}=Ff IB \ 
(in,. z(xJ 9 ib (&))) 



o(x) 



'3 gIB (x)-l 



p(x) 



(24) 



/ p(x,x)p(y\x) \ 



(25) 

Here, (a) invokes the additive duality in the second term in 
order to introduce the expected effective distortion in q— space, 
and, (6) invokes (12) to obtain the total probability p(x, x, y). 
In (25), F q IB [•] is the sum of two generalized K-Ld's having 
nonadditivity parameters q* and q, where < q* < 1 and 
q > 1. From [24], it is readily follows that F q IB [»] is non- 
negative and convex. The expected effective distortion term in 
(25) is related to the relevant information as 



^bH 



\X\X 



-PgiB ^ p (x, x) ^ p ( y | x) In, 



pjy\x) 
p(y\ x) 



D, 



-PgiB E P(x,x,y) q [\n q p(y\x) - \n q p(y\x)} 
I q {X;Y)-I q (x-Y) 



eft 



(26) 



x,x,y 



= I q , [X;Xj +l3 gIB 
^D eff =I q (X;Y)- 



I q \X-Y 



Note I q yX-.Yj < I q (X;Y) by the generalized data pro- 
cessing inequality [10]. Here, (a) employs ln 9 (x/y) = 
t/ 9_1 (ln 9 (x) - h\ q (y)) in (6), (b) adds and subtracts S q (Y), 
and, invokes (11) and the symmetry of the GMI: I q (Y;X) = 
I q (X;Y);I q (Y;X) = I q (X;Y). From (22)-(24), an em- 
pirical criterion equivalent to the Tsallis cut-off condition, 
described in terms of the gIB free energy for any x G X, 
is:l + (q*-l)Ff IB [.}(x)<0. 

IV. The Update Equations 

Lemma 1: Given a joint distribution p(x)p(x|x), the distri- 
bution p(x) that minimizes D q K _ L [p(X)p(X\X)\\p(X)p(X)} 
is the marginal p* (x) = ^p(x)p(x|x), i.e. 



ni 

U K-L 



p(x) 



{X)p( y x\x) \\ P (X)p* (x) 



p{X)p(x\x) \\p(X)p(x 



(27) 



Also 



(Di K _ L [p(x\x)\\p* (£)])_ 
™&{ D K-L [p( x \ x )\\p( x )]) 

p(x) 



p(x) 



p{x) ' 



(28) 



Proof: The positivity condition for (27) is proven in Lemma 
1 of [7], with q* replacing q. The positivity condition for (28) 
is 



1 \ p( x\x) 



( = } - £p(x, x) hv + £ p (x, x) In,. (fg>) 



(b) 



p(x) 



>0; 



[p* (i)IIP(S)] 
x l + {q-l){D] ( _ L [p{x\x)\\p{x))) 

VO < g* < l,q > 1. 

(29) 

Here, (a) invokes the additive duality, (b) employs 
the q-deformed algebra definition for ln 9 » (|) 
from (6) [21] by multiplying and dividing by 

2?(xi)+(i-,*)EKxi)v(^) 



i, 



and, establishes p* (5) = 0*0 P (^1 x ) a f ter subjecting 

X 

the term within brackets ([•]) in (29) to the additive duality 
(0 = 2-?*). 

The free energy [•] is convex only when independently 
evaluated with respect to one of the convex distribution sets 
{p (i| a;)}, {p(x)} 7 and {p(y\x)}. The update equations 
which minimize the free energy are obtained by projecting the 
free energy onto each convex distribution while keeping the 
other two arguments constant. 

Theorem 1: Equations (14), (15) and (23) are satisfied at 
the minima of the free energy (24) for each argument of the 
free energy as 

min min min F% [p ( x \ x) ; p (x) ; p ( y \ x)] . (30) 

p(y\x) p(x) p(x\x) 

Denoting the iteration level as (r), minimization is performed 
independently by converging alternating iterations 



<T+V (x\ x) «- „W (r) ^[-^'''^-^'''''l^'^'l] 

y _> z(-+y(x,p gIB (x)) 

PglB PgIB 

gIB K-J ~ ^ b(x) ~ z W ( x>/W a))i-«-' 

(y| x) - -r^r Zp(x, y)pW (x\ x). 



p< T+1 ) {x\x) ^p (r) (x) 
w/iere, (3 { ^ B (x) - 



P 



p^+i)(x) 



(31) 

Proof: The outline of the proof is presented herein ow- 
ing to space constraints. Defining [•] = F^ IB [•] + 
A (x) (p(x|x) — 1), and following the procedure in Section 

SF q * \t] 

III.B. of this paper, Sr J^ x) = exactly yields (26). Min- 
imization with respect to p(y\x) affects only D e ff in (25). 
Defining Pf [•] = Ff IB [•] + A (x) (p(y\ x) - 1) and invok- 
ing D eff = I g (X;Y) - I q (X;Y) from (26), employing (4), 



(11), (12), and (15) yields 

>J2p(x,x,y) q —, 

x 

^jyr ^2p(x,y) Q p(x\x) 



P gIB j:p {x , X ,y)^J^i01 + ~X(£) = O 



A 1 (x)=0;A 1 = M (32) 



p(y\x) q 

=>p(y\ x ) = Jzj T,p(x,v)p(x\x). 

x 

From (27), it may be shown thatp(x) minimizes I q * (X; X). 
Since, D e f / is the expectation of a generalized K-Ld, (28) is 
applied to demonstrate that p(x) is a minimizer of D e //. Note 
that the gIB update equations are not globally covergent. 

V. Conclusions and Discussions 

Akin to the RD theory, the degree of compression may be 
assessed by the compression information I q * (X;X). However, 
while the RD method is upper bounded by an a-priori 
chosen optimal expected distortion D, the gIB method is 
lower bounded by the relevant information l q * (X; Y). It 
has been demonstrated that in lossy compression, I q *(X;X) 
is always lower than its counterpart obtained using B-G-S 
statistics [7]. This observation implies that gIB relevance- 
compression curves will tend to traverse the forbidden region 
of an equivalent IB method based on B-G-S statistics. Future 
work casts the gIB model within the framework of Bregman 
divergences [25]. 
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